Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that remembers values over arbitrary intervals. Stored values are not modified as learning proceeds. RNNs allow forward and backward connections between neurons.
An LSTM network contains LSTM units instead of, or in addition to, other network units. An LSTM unit remembers values for either long or short time periods. The key to this ability is that it uses no activation function within its recurrent components. Thus, the stored value is not iteratively modified and the gradient does not tend to vanish when trained with backpropagation through time.
0 - negative
1 - negative
Trains an LSTM model on the IMDB sentiment classification task. The dataset is actually too small for LSTM to be of any advantage compared to simpler, much faster methods such as TF-IDF + LogReg.
Notes:
In [48]:
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
import numpy as np
max_features = 20000
maxlen = 80 # cut texts after this number of words (among top max_features most common words)
batch_size = 200
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
Arguments:
path: if you do not have the data locally (at '~/.keras/datasets/' + path), it will be downloaded to this location.
num_words: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data.
skip_top: integer. Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
maxlen: int. Maximum sequence length. Any longer sequence will be truncated.
seed: int. Seed for reproducible data shuffling.
start_char: int. The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
oov_char: int. words that were cut out because of the num_words or skip_top limit will be replaced with this character.
index_from: int. Index actual words with this index and higher.
Source: https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification
In [ ]:
'''
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
'''
In [38]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
In [41]:
print (X_train.shape, X_test.shape)
In [21]:
X_train
Out[21]:
In [57]:
for i in range(0,3):
print (len(X_train[i]))
Making each training data sample of same size/length.
In [58]:
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
In [23]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
In [24]:
X_train
Out[24]:
In [60]:
for i in range(0,3):
print (len(X_train[i]))
In [25]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) # 128 = dimensionality of the output space
model.add(Dense(1, activation='sigmoid'))
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
In [31]:
print('Train...')
model.fit(X_train, y_train,
batch_size=batch_size,
epochs=3,
validation_data=(X_test, y_test))
Out[31]:
In [32]:
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)